Setting up

Below is the setting up for this class (install packages, mount packages, import data)

Install Packages

Libraries we need to install (remember to uncomment before running)

# Set a CRAN mirror
options(repos = c(CRAN = "https://cran.r-project.org"))
# for Part 1: 
#install.packages("tidyverse")
#install.packages("palmerpenguins")

# for Part 2:
#install.packages("tmap")
#install.packages("sf")
#install.packages("RColorBrewer")

# for Part 3:
#install.packages("tidytext")
#install.packages("janeaustenr")
#install.packages("magick") 
#install.packages("devtools")

Load Libraries

Load the packages we need

# for part 1
library(tidyverse)
library(palmerpenguins)

#for part 2
library(tmap)
library(sf)
library(RColorBrewer)

# for part 3
library(tidytext) # to work with unstructured data
library(janeaustenr) # to fetch the dataset
library(magick) # to display images

Part 1: Quick Overview of Grammar of Graphics

For this first quick overview we are going to use our dear Palmer Penguins image

Palmer Penguins is a great dataset for data exploration and visualisation a and generally a really good alternative to Iris or matcar.

Data were collected and made available by Dr.Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The palmerpenguins package contains two datasets. One is called penguins, and is a simplified version of the raw dataset, the second dataset is penguins_raw and contains the raw data.

Grammar of Graphics Levels

image - A set of rules to facilitate data visualisation developed by Leland Wilkinson. - It allows us to think of plot generation by following a step by step recipe (and like all good recipes, can be modified when needed). - For a full overview, check out Hadley Wickham’s free textbook on using ggplot. - For inspiration on more complex plots and advanced techniques, Cedric Sherer’s blog is full of great ideas!

The recipe of a nice plot includes:

  1. The data
  2. Aesthetic Mapping
  3. Geometric Objects
  4. Facets (Subplotting)
  5. Statistical Transformations
  6. Scaling and Changing Coordinates
  7. Themes

Building our Plot

  • Our first call is to use the function ggplot()
  • By itself, it looks like this:
ggplot()

Level 1: Data

The first level is just the data we are going to use

  • From there, we need to specify the data we need.

  • We can feed in the data as it is.

However, at this stage our plot is still blank.

# Regular data
ggplot(data = penguins)

Level 2: Aesthetics

  • Our plots are empty as we’ve not told R what variables to use.
  • Thankfully, the command is very simple, using the aes() command after specifying the data.
  • Here we simply label our x and y coordinates.
  • But as we will see, we still don’t have any visual ques in the plot itself.
head(penguins, 20)
## # A tibble: 20 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## 11 Adelie  Torgersen           37.8          17.1               186        3300
## 12 Adelie  Torgersen           37.8          17.3               180        3700
## 13 Adelie  Torgersen           41.1          17.6               182        3200
## 14 Adelie  Torgersen           38.6          21.2               191        3800
## 15 Adelie  Torgersen           34.6          21.1               198        4400
## 16 Adelie  Torgersen           36.6          17.8               185        3700
## 17 Adelie  Torgersen           38.7          19                 195        3450
## 18 Adelie  Torgersen           42.5          20.7               197        4500
## 19 Adelie  Torgersen           34.4          18.4               184        3325
## 20 Adelie  Torgersen           46            21.5               194        4200
## # ℹ 2 more variables: sex <fct>, year <int>
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g))  # setting x and y

Level 3: Geometries

  • Here we link the coordinates from our data to visual geometries.
  • We use geoms (geometric functions) to decide how to shape the coordinates.
  • This allows us to shape our plot with our determined coordinates.
  • We can also apply statistical functions with the geoms.
  • We can place multiple geom layers.
  • The order of our layers is determined by the order we code.
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) +
  geom_point()

Now that we can finally see something let’s refine the coordinates/aesthetics a bit more playing with colours and shapes

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point()

It is not the best of ideas but we can also use shape to add an additional layer of information

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species, size=flipper_length_mm)) +
  geom_point()

NB. if you want colour and size to be connected with a variable (i.e. being part of the legend you set them up within the aesthetics otherwise they goes in the geometry layer)

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point(size=4)

Level 4: Facet

If we want to subdivide the plot in more subplot you can use facet_wrap or facet_grid

facet_wrap is to be used if you want to subplot only by one variable

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point()+
  facet_wrap(~species)

facet_gridif you want to create an array across two variables

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point()+
  facet_grid(sex~species)

Level 5: Adding statistical layers

  • Here we can model visualise any model we’ve used in our analyses.
  • This is a vital step for communicating our research.
  • It’s also a fundamental step in validating our findings.
  • R has so many options for visualising our models! From loess and linear models, to means and standard deviations.
ggplot(data = penguins,
  aes(x = species, y = flipper_length_mm))+
  geom_boxplot()+ 
  stat_summary(fun.data = mean_se,
    color = "red") # adjusting size

Level 6: Coordinates

On this level you can set the attributes of the coordinates, change the scale or transform

ggplot(data = penguins,
  aes(x = species, y = flipper_length_mm))+
  geom_boxplot()+ 
  stat_summary(fun.data = mean_se,
    color = "red")+
  coord_flip()

Level 7: Themes and Global Settings

On this level you can set everything that is not connected directly with the data, from background to colours and axis labels.

Backgrounds

Change the background to Black and White

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point(size=3)+
  theme_bw()

Labels

Add Title, subtitle etc..

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point(size=3)+  theme_bw()+
  labs(title = "New plot title", subtitle = "A subtitle", caption = "(based on data from ...)", x = "New x label", y= "New y label", color = "Colours")

Colours

This is my favourite rabbit hole.

  • Some journals will require certain stylistic designs for your figures.
  • This will include colour schemes, but also includes fonts and other design features.
  • Thankfully ggplot allows us to control these aspects of the plot with full customisation options as well.

Pre-made colour and theme options also exist:

In the interest of time we are going to see just one simple example but really the sky is the limit you can start having a look from this wiki

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point(size=3)+  theme_bw()+
  labs(title = "New plot title", subtitle = "A subtitle", caption = "(based on data from ...)", x = "New x label", y= "New y label", color = "Colours")+
  scale_colour_manual(values = c("darkorange","purple","cyan4"))

N.B. depending on the graph type it could be fill/colour

Exercise 1:

Create a visualisation using the Penguins dataset that will show the relationship between bill length and bill depth across the different species of penguins and the different sexes. Rename the graph “My Penguin Graph” and transform the units of measurements in cm (Tip: you can divide x and y directly in the aesthetic and use the labs level)

ggplot( )

Ok now that we have cover some basics let’s focus on something funnier, first some geographical plotting and then sentiment analysis results

Part2: Geographical Data

For this part For the History of Scotland-focused text analysis, we are going to explore the Statistical Accounts of Scotland. More information on the dataset can be found on the StatAccount Website. The ‘Old’ Statistical Account (1791-99), under the direction of Sir John Sinclair of Ulbster, and the ‘New’ Statistical Account (1834-45) are reports of life In Scotland during the XVIII and XIX centuries.

They offer uniquely rich and detailed parish reports for the whole of Scotland, covering a vast range of topics including agriculture, education, trades, religion and social customs.

We are also going to use edited data from the National Records of Scotland

Import Data

Import the data that we will need

Import the Statistical Account Data

First the CSV containing the text of the StatAccount

Parish <- read_csv("data/parish.csv")
## Rows: 27065 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): title, text, Type, TypeDescriptive, RecordID, Area, Parish
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(Parish)
##     title               text               Type           TypeDescriptive   
##  Length:27065       Length:27065       Length:27065       Length:27065      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    RecordID             Area              Parish         
##  Length:27065       Length:27065       Length:27065      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Import the Geographical Data

Then we import the first GeoPackage. A GeoPackage is an open, standards-based format designed for the efficient storage, transfer, and exchange of geospatial data. Developed by the Open Geospatial Consortium (OGC), it serves as a container for various types of geospatial information, including vector features, raster maps, and attribute data, all within a single file https://www.geopackage.org/.

st_read Function: * from st package that reads vector spatial data. * dsn = data source name, essentially the file name and the folder path

ParishesGeo <- st_read(dsn = "data/Spatial/Parishes.gpkg")
## Multiple layers are present in data source C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\Parishes.gpkg, reading layer `civilparish_pre1891'.
## Use `st_layers' to list all layer names and their type in a data source.
## Set the `layer' argument in `st_read' to read a particular layer.
## Reading layer `civilparish_pre1891' from data source 
##   `C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\Parishes.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 35 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 5550.178 ymin: 530264.1 xmax: 469816.2 ymax: 1220373
## Projected CRS: OSGB36 / British National Grid
plot(ParishesGeo, main = "Scottish Parishes")

As you can see from the plot, the dataset is made up of vector polygons. You can also change the basic presentation, such as the colour of the fill, line width and colour.

plot(ParishesGeo,
     col = "black",
     lwd = 1,
     border = "white",
     main = "Scottish Parishes")

Besides the parish boundaries, we will also need the location of distilleries across Scotland. Load geo-spatial information for the location of distilleries. This is a vector point dataset.

PointsDistilleries<- st_read(dsn = "data/Spatial/ScottishDistilleries.gpkg")
## Reading layer `scotdistilleries' from data source 
##   `C:\Users\lmichiel\Documents\GitHub\DH-RSESummerSchool2024\day 1\DataVisWithR\data\Spatial\ScottishDistilleries.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 109 features and 20 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 126650.9 ymin: 554418.5 xmax: 412088.9 ymax: 1010713
## Projected CRS: OSGB36 / British National Grid
plot(PointsDistilleries,
     main = "Scottish Distilleries")
## Warning: plotting the first 10 out of 20 attributes; use max.plot = 20 to plot
## all

We will work with first, the vector polygons containing the parish boundaries. At the moment, the vector polygon dataset contains only info from the Geopackage. To add the info from the parish dataset (i.e. information from the csv file) we need to merge the Geopackage with it.

Work on Illness Mentions

Extract Information from the textual data

Because we want to see how often mention of a certain topic are present in the text we want to search for specific keywords

The first topic we are going to look at is Illness. So we are creating a new variable that would contain yes if the text contains one of the keywords or no if it does not

1.Search keywords

Parish$Ilness<- ifelse(grepl("ill|ilness|sick|cholera|smallpox|plague|cough|typhoid|fever|measles|dysentery", Parish$text,
                             ignore.case = T), "yes","no")

head(Parish$Ilness)
## [1] "yes" "yes" "yes" "yes" "no"  "yes"
  1. Group by Illness and geographical area

To do this we use a pipe, if you have never seen a pipe before is basically a way to perform a series of action on a dataset in a certain order (you can think at it as bullet points of actions)

IlnessGroup <- Parish %>%
  group_by(Area) %>%
  summarise(Total = n(),
            count = sum(Ilness == "yes")) %>%
  mutate(per = round(count/Total, 2))

head(IlnessGroup)
## # A tibble: 6 × 4
##   Area     Total count   per
##   <chr>    <int> <int> <dbl>
## 1 Aberdeen  2193  1790  0.82
## 2 Argyle    1336  1093  0.82
## 3 Ayrshire  1493  1312  0.88
## 4 Banff      820   697  0.85
## 5 Berwick    676   571  0.84
## 6 Bute       147   128  0.87

3.Merge the two datasets

MergedGeo <-merge(ParishesGeo,IlnessGroup,
                  by.x="JOIN_NAME_",
                  by.y="Area",
                  all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo

4.Check data to have merged properly

head(MergedGeo, max.level = 2)
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 92508.41 ymin: 571189.1 xmax: 414043.2 ymax: 868786.8
## Projected CRS: OSGB36 / British National Grid
##   JOIN_NAME_ Total count  per                       geometry
## 1   Aberdeen  2193  1790 0.82 MULTIPOLYGON (((394004.6 80...
## 2     Argyle  1336  1093 0.82 MULTIPOLYGON (((183373.2 73...
## 3   Ayrshire  1493  1312 0.88 MULTIPOLYGON (((218097.2 65...
## 4      Banff   820   697 0.85 MULTIPOLYGON (((316487.5 83...
## 5    Berwick   676   571 0.84 MULTIPOLYGON (((372205.7 66...
## 6       Bute   147   128 0.87 MULTIPOLYGON (((206008.4 63...

Visualise the new dataset

1.Create a continuous color palette

color.palette <- colorRampPalette(c("white", "red"))
  1. Spatial plot using tmap

tm_shape is a function in the tmap package (Thematic maps). Thematic maps can be generated with great flexibility. The syntax for creating plots is similar to that of ggplot2, but tailored to maps. To plot a tmap, you will need to specify firstly tm_shape, layers then can be added with the + operator. tm_fill specifies the presentation of the polygons. To differentiate NA values from other valid entries, colorNA is added.

  • col.regions = color.palette(100): specifies the colour to fill the polygon, now set to generate a palette with 100 distinct colours.
tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
  tm_fill("per", palette = color.palette(100), colorNA = "grey") + # Fill polygons based on 'per' variable, using a custom color palette with 100 colors; grey for NA values
  tm_borders(col = "black") + # Add black borders to each polygon
  tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame

Work with map colours

Let’s try changing the colour of the filled regions using predifined colours. There are predifined colour palettes you can use directly. Commonly used palettes include: rainbow(), heat.colors(), topo.colors(), and terrain.colors() Beware of the representation of colours. You might need to reverse the colour band to make the representations more intuitive.

tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
  tm_fill("per", palette = rev(heat.colors(100)), colorNA = "grey") + # Fill polygons based on 'per' variable, using a reversed heat.colors palette with 100 colors; grey for NA values
  tm_borders(col = "black") + # Add black borders to each polygon
  tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame

You could also change the colour using RColorBrewer

display.brewer.all()# show all the palettes in Colour brewer

color.palette <- brewer.pal(n = 9, name = "YlOrRd")#create a tailored new palette

We can now replot using the new palette.

tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
  tm_fill("per", palette = color.palette, colorNA = "grey") + # Fill polygons based on 'per' variable, using a custom color palette (color.palette); grey for NA values
  tm_borders(col = "black") + # Add black borders to each polygon
  tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE) # Set layout: add a title, resize legend text and title, remove frame

Exercise 2

Try to re-plot the map using a different colour range. Add your code below.

Work on the legend intervals

Change the spacing of the interval. The intervals can be keyed in directly using and style to change the type of breaks

1. “fixed”: User-defined fixed breaks.

2. “pretty”: Breaks at pretty intervals (often used for visual appeal).

3. “quantile”: Breaks at quantile intervals (each class has an equal number of observations).

4. “equal”: Breaks at equal intervals.

5. “kmeans”: Breaks determined by k-means clustering.

6. “hclust”: Breaks determined by hierarchical clustering.

7. “bclust”: Breaks determined by bin-based clustering.

8. “fisher”: Breaks determined by Fisher-Jenks natural breaks optimization.

9. “jenks”: Another name for Fisher-Jenks breaks.

10. “sd”: Breaks determined by standard deviations from the mean.

11. “log10_pretty”: Breaks determined by log10 transformed values with pretty intervals.

12. “cont”: Continuous color scale (no discrete breaks).

tm_shape(MergedGeo) + # Specify the spatial object (MergedGeo) to be used in the map
  tm_fill("per", style = "equal", n = 10, palette = color.palette, colorNA = "grey") + # Fill polygons based on 'per' variable; use equal interval classification with 10 classes; custom color palette; grey for NA values
  tm_borders(col = "black") + # Add black borders to each polygon
  tm_layout(title = "Illness report", legend.text.size = 0.75, legend.title.size = 1, frame = FALSE, legend.position = c(1, 0.5)) # Set layout: add a title, resize legend text and title, remove frame, position legend at (1, 0.5)

Exercise 3:

Try adjusting these values and explore the effects. Write your code below.

Now we can work on a different subject: Witches

The steps are always the same first we need to search keywords and then we merge the results with our map of Scotland.

If you want to try out yourself more instead of looking at the code below try to replicate the steps we have done before for the illnesses below here.

#Parish$witches<-ifelse

Preparing the data set

Parish$witches<- ifelse(grepl("witch|spell|witches|enchantemt|magic", Parish$text, ignore.case = T), "yes","no")

Can you think to other keywords? just add them to the code above.

Then we group by

WitchGroup <- Parish %>%
  group_by(Area) %>%
  summarise(Total = n(), count = sum(witches == "yes")) %>%
  mutate(per = round(count / Total, 2))

And finally we merge

MergedGeo2 <-merge(ParishesGeo,WitchGroup, by.x="JOIN_NAME_", by.y="Area", all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo

Let’s create a more “witchy” Palette

color.palette2 <- colorRampPalette(c("white", "purple"), alpha = 0.5)

Plot the result

tm_shape(MergedGeo2) +
  tm_fill("per", palette = color.palette2(100), colorNA = "grey") +
  tm_borders(col = "black")+
  tm_layout(title = "Witchcraft report",
            legend.text.size = 0.75,
            legend.title.size = 1,
            frame = FALSE)

Refine the results:Adding scale bar and north arrow

Adding the scale bar and north arrow to the map using tmap is a lot simpler.

tm_shape(MergedGeo2) +
  tm_fill("per",
          style = "equal",
          n = 5,
          palette = color.palette2(100),
          colorNA = "grey") +
  tm_borders(col = "black")+
  tm_layout(title = "Witches Reports",
            legend.text.size = 0.75,
            legend.title.size = 1,
            frame = FALSE) +
  tm_scale_bar(position = "left") + #add scalebar
  tm_compass(size = 1.5)#add north arrow

Whiski consumption

Let’s connect back to the one of the main topics of this week and look at whiski consumption across Scotland.

Unsurprisingly the first steps remain the same.

If you want to try out yourself more instead of looking at the code below try to replicate the steps we have done before below here.

#Parish$Booze<-ifelse
  1. Search the keywords
Parish$Booze<- ifelse(grepl("illicit still|illicit distillery|drunk|intemperance|wisky|whisky|whiskey|whysky |alembic",Parish$text, ignore.case = T), "yes","no")
  1. Group by the new column and area
BoozeGroup <- Parish %>%
  group_by(Area) %>%
  summarise(Total = n(), count = sum(Booze == "yes")) %>%
  mutate(per = round(count / Total, 2))
  1. Merge back
MergedGeo3 <-merge(ParishesGeo,BoozeGroup, by.x="JOIN_NAME_", by.y="Area",all.x = TRUE) # nb this is left join cause I want to preserve all the records present in ParishGeo
  1. Create a Palette
color.palette3 <- colorRampPalette(c("white", "Brown"))
  1. Plot with tmap
tm_shape(MergedGeo3) +
  tm_fill("per",
          style = "equal",
          n = 5,
          palette = color.palette3(100),
          colorNA = "grey") +
  tm_borders(col = "black")+
  tm_layout(title = "Whisky Reports",
            legend.text.size = 0.75,
            legend.title.size = 1,
            frame = FALSE) +
  tm_scale_bar(position = "left") +
  tm_compass(size = 1.5)

Work with multiple datasets

Add the second dataset i.e. the punctual dataset with the location of the modern day distilleries.

tm_shape(MergedGeo3) +
  tm_fill("per",
          style = "equal",
          n = 5,
          palette = color.palette3(100),
          colorNA = "grey") +
  tm_borders(col = "black")+
  tm_layout(title = "Whisky Reports",
            legend.text.size = 0.75,
            legend.title.size = 1,
            frame = FALSE) +
  tm_scale_bar(position = "left") +
  tm_compass(size = 1.5)+
  tm_shape(PointsDistilleries) + # we add our new datest
  tm_dots(size=0.1,
          col="black", #This time they are dots rather than fill
          colorNA = NULL)

We can also use bespoke symbols for distilleries locations and plot it again.

icon <- tmap_icons("data/bottle.png") 

tm_shape(MergedGeo3) + 
  # Fill the polygons based on the "per" attribute
  tm_fill("per", 
          style = "equal",  # Use equal interval breaks
          n = 5,  # Number of classes to divide the data into
          palette = color.palette3(100),  # Color palette with 100 color levels
          colorNA = "grey") +  # Color for missing values
  # Add borders to the polygons
  tm_borders(col = "black") +
  # Add another spatial object to the map
  tm_shape(PointsDistilleries) +
  # Add symbols (icons) for the spatial points
  tm_symbols(size = 0.3, # Size of the symbols
             clustering = TRUE,
             shape = icon,# Symbol shape (specified by the 'icon' variable)
             border.lwd = 0,) +   # Border width of the symbols 
  # Add layout elements like title and legend settings
  tm_layout(title = "Booze Reports",  # Title of the map
            legend.text.size = 0.75,  # Size of the legend text
            legend.title.size = 1,  # Size of the legend title
            frame = FALSE) +  # Do not draw a frame around the map
  # Add a scale bar to the map
  tm_scale_bar(position = "left") +  # Position of the scale bar
  # Add a compass to the map
  tm_compass(size = 1.5)  # Size of the compass

Exercise 5:

Search for a different topic in the dataset and create a new visualisation

Part 3: Working with Sentiment Analysis Data

Download and look at the bing sentiment library

get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows

This is a short demo to show how you can create a .gif that would record the sentiment evolution across Jane Austen books. To do so we need to follow these steps

  1. Create a Table where we are going to automatically extract each word in Jane Austen books.
  2. Calculate the sentiment values for subset of each chapters
  3. Plot a graph for each book that will show the sentiment values
  4. Collate the single graphs into a single gif

1. Create a word collection table

For each words we are going to collect information about

  • the book it belongs to
  • the line number where it could be found
  • the chapter where it could be found
  • the word itself
AustenTable <- austen_books() %>% #create a new file named AustenTable that will extract info from austen_books
  group_by(book) %>% # group by every single book then
  mutate( # manipulate the data to create
    linenumber = row_number(), # a line number column that would count in which row the word was
    chapter = cumsum(str_detect(text, # the chapter number. We can do so by using regex and find lines starting with chapter followed by a space and a letter 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>% #  This line removes the grouping, so subsequent operations will be applied to the entire dataset rather than grouped subsets.
  unnest_tokens(word, text) # This tokenises the text column, splitting it into individual words and creating a new row for each word

head(AustenTable)
## # A tibble: 6 × 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

2. Create a word collection table

Now that we have the list of all words we extract the average sentiment of subsets of each chapter of each book. To do so as before we manipulate our dataset to do what we want.

Because we want to create an uniform pattern that would simulate a tapestry we want to divide our subset into equal collections of words

jane_austen_sentiment <- AustenTable %>% # Load Jane Austen's books dataset and start a chain of operations to create a Jane_austen_sentiment dataset
  inner_join(get_sentiments("bing"), relationship = 'many-to-many') %>% # Join the dataset with the Bing lexicon sentiment dictionary
  group_by(book, chapter) %>%   # Group the dataset by book and chapter
  mutate(index = rep(1:10, each = ceiling(n() / 10), length.out = n())) %>% # Create an index to split the chapters into 10 segments no matter how long the chapter it is 
  group_by(book, chapter, index) %>% # Regroup the dataset by book, chapter, and index
  count(sentiment) %>%  # Count the occurrences of each sentiment within each segment
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% # Reshape the data from long to wide format
  mutate(sentiment = positive - negative,index= as.factor(index))%>%  # Calculate sentiment score (positive - negative) for each segment
  filter(!chapter=="0") # Filter out chapters with the value "0" (if any)
## Joining with `by = join_by(word)`
head(jane_austen_sentiment)
## # A tibble: 6 × 6
## # Groups:   book, chapter, index [6]
##   book                chapter index negative positive sentiment
##   <fct>                 <int> <fct>    <int>    <int>     <int>
## 1 Sense & Sensibility       1 1            2       11         9
## 2 Sense & Sensibility       1 2            4        9         5
## 3 Sense & Sensibility       1 3            7        6        -1
## 4 Sense & Sensibility       1 4            3       10         7
## 5 Sense & Sensibility       1 5            4        9         5
## 6 Sense & Sensibility       1 6            3       10         7

3. Plot a graph for each book

Plot a graph for each book that will show the sentiment values tapestry across the book To do so we need to do a series of small steps

A. Create a directory to which the images will be written

dir_out <- file.path("outputs/Austen") # Define the directory path where the outputs will be saved
dir.create(dir_out, recursive = TRUE) # Create the directory if it doesn't already exist
## Warning in dir.create(dir_out, recursive = TRUE): 'outputs\Austen' already
## exists

B. List all the books that are inside our dataset

books <- unique(jane_austen_sentiment$book)
books 
## [1] Sense & Sensibility Pride & Prejudice   Mansfield Park     
## [4] Emma                Northanger Abbey    Persuasion         
## 6 Levels: Sense & Sensibility Pride & Prejudice Mansfield Park ... Persuasion

C. Find which is the max number of chapters each book has

most_chapter <- max(jane_austen_sentiment$chapter, na.rm = TRUE)# Find the maximum chapter number in the dataset 'jane_austen_sentiment'
most_chapter
## [1] 61

D. Print a graph for each book using a for loop

Now we have all that we need to create a for loop that will automatically create a graph for each book. A “for” loop is a control flow statement in programming languages that allows you to repeatedly execute a block of code a specified number of times or iterate over a sequence of values.

for (y in books) {# Iterate over each book in the 'books' vector y is the name we are going to the iteration variable it can be anything as long as you are consistent
  
  p <- # p is just a name we are giving to the plot again you can change it as long as you are consistent
    jane_austen_sentiment %>%
    filter(book == y) %>% # Filter the 'jane_austen_sentiment' dataset for the current book
    ggplot(aes(chapter,index, fill= sentiment)) +  # Create a ggplot object with chapter, index, and sentiment as aesthetics
    geom_tile() +# Add a tile layer to create a heatmap
    scale_x_continuous(breaks=seq(1,most_chapter,1), expand = c(0,0))+  # Customise x-axis scale to show breaks from 1 to 'most_chapter'
    scale_fill_gradient(low="blue", high="red", limits = c(-20, 40))+# Customise fill scale to use a gradient from blue to red
    theme_bw()+ # Apply a black-and-white theme
    guides(fill="none")+  # Remove the fill legend
    ggtitle(y)+ # Add a title to the plot with the current book's name
    coord_fixed(ratio = 1, ylim = c(10,1), xlim = c(0.5,most_chapter+0.5))+  # Fix the aspect ratio and set limits for y and x axes
    theme(  # Customize theme to remove y-axis labels and ticks
      axis.title.y = element_blank(),
      axis.text.y= element_blank(),
      axis.ticks.y = element_blank()
    )
  
  fp <- file.path(dir_out, paste0(y, ".png"))# Define the file path where the plot will be saved
  
  ggsave(plot = p,  # Save the ggplot object as a PNG file
         filename = fp,   # File path where the plot will be saved
         device = "png", # Output device type (PNG format)
         width=3500,# Width of the output in pixels
         height = 1000, # Height of the output in pixels
         units = "px") # Units of width and height (pixels)
  
}# Close the loop

The bit of code below it is just to look at one of the plots created

Image <- image_read('outputs/Austen/Emma.png')
Image

Good! We are almost there now we need to create a gif out of the single plots

4. Create and Save a gif

A. List file names and read in

imgs <- list.files(dir_out, full.names = TRUE) # List all file names in the directory 'dir_out' and store them in 'imgs'
img_list <- lapply(imgs, image_read)  # Read each image file from the list of file names using 'image_read' and store them in 'img_list'

B. Join the images together

img_joined <- image_join(img_list) # Join the list of images into a single animated image using 'image_join'

C. Animate 1 frame per second

img_animated <- image_animate(img_joined, fps = 1) # Create an animated image from the joined image with a frame rate of 1 frame per second using 'image_animate'

D. Save to disk

image_write(image = img_animated,
            path = "outputs/austen.gif")

Let’s look at what we have done

img_animated

Excercise 6:

Create a similar visualisation using a different datasets. You can have a look of what is directly available in R here

Hint version using Sherlock books

Solution

devtools::install_github("EmilHvitfeldt/sherlock")
## Using GitHub PAT from the git credential store.
## Skipping install of 'sherlock' from a github remote, the SHA1 (38584034) has not changed since last install.
##   Use `force = TRUE` to force installation
library(sherlock)

SherlockTable <- holmes %>% #create a new file named AustenTable that will extract info from austen_books
  group_by(book) %>% # group by every single book then
  mutate( # manipulate the data to create
    linenumber = row_number(), # a line number column that would count in which row the word was
    chapter = cumsum(str_detect(text, # the chapter number. We can do so by using regex and find lines starting with chapter followed by a space and a letter 
                                regex("^CHAPTER", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>% #  This line removes the grouping, so subsequent operations will be applied to the entire dataset rather than grouped subsets.
  unnest_tokens(word, text) # This tokenises the text column, splitting it into individual words and creating a new row for each word

head(SherlockTable)
## # A tibble: 6 × 4
##   book               linenumber chapter word   
##   <chr>                   <int>   <int> <chr>  
## 1 A Study In Scarlet          1       0 a      
## 2 A Study In Scarlet          1       0 study  
## 3 A Study In Scarlet          1       0 in     
## 4 A Study In Scarlet          1       0 scarlet
## 5 A Study In Scarlet          3       0 table  
## 6 A Study In Scarlet          3       0 of
Sherlock_sentiment <- SherlockTable %>% 
  inner_join(get_sentiments("bing"), relationship = 'many-to-many') %>% # Join the dataset with the Bing lexicon sentiment dictionary
  group_by(book, chapter) %>%   # Group the dataset by book and chapter
  mutate(index = rep(1:10, each = ceiling(n() / 10), length.out = n())) %>% # Create an index to split the chapters into 10 segments no matter how long the chapter it is 
  group_by(book, chapter, index) %>% # Regroup the dataset by book, chapter, and index
  count(sentiment) %>%  # Count the occurrences of each sentiment within each segment
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% # Reshape the data from long to wide format
  mutate(sentiment = positive - negative,index= as.factor(index))%>%  # Calculate sentiment score (positive - negative) for each segment
  filter(!chapter=="0") # Filter out chapters with the value "0" (if any)
## Joining with `by = join_by(word)`
head(Sherlock_sentiment)
## # A tibble: 6 × 6
## # Groups:   book, chapter, index [6]
##   book                 chapter index negative positive sentiment
##   <chr>                  <int> <fct>    <int>    <int>     <int>
## 1 A Scandal in Bohemia       4 1            8       13         5
## 2 A Scandal in Bohemia       4 2            9       12         3
## 3 A Scandal in Bohemia       4 3           11       10        -1
## 4 A Scandal in Bohemia       4 4            9       12         3
## 5 A Scandal in Bohemia       4 5           11       10        -1
## 6 A Scandal in Bohemia       4 6            9       12         3
dir_outS <- file.path("outputs/Sherlock") # Define the directory path where the outputs will be saved
dir.create(dir_outS, recursive = TRUE) # Create the directory if it doesn't already exist
## Warning in dir.create(dir_outS, recursive = TRUE): 'outputs\Sherlock' already
## exists
books2 <- unique(Sherlock_sentiment$book)
books2 
## [1] "A Scandal in Bohemia"            "A Study In Scarlet"             
## [3] "The Adventure of Wisteria Lodge" "The Adventure of the Red Circle"
## [5] "The Hound of the Baskervilles"   "The Sign of the Four"           
## [7] "The Valley Of Fear"
most_chapter <- max(Sherlock_sentiment$chapter, na.rm = TRUE)# Find the maximum chapter number in the dataset 
most_chapter
## [1] 15
for (y in books2) {# Iterate over each book in the 'books' vector y is the name we are going to the iteration variable it can be anything as long as you are consistent
  
  p <- # p is just a name we are giving to the plot again you can change it as long as you are consistent
    Sherlock_sentiment %>%
    filter(book == y) %>% 
    ggplot(aes(chapter,index, fill= sentiment)) +  # Create a ggplot object with chapter, index, and sentiment as aesthetics
    geom_tile() +# Add a tile layer to create a heatmap
    scale_x_continuous(breaks=seq(1,most_chapter,1), expand = c(0,0))+  # Customise x-axis scale to show breaks from 1 to 'most_chapter'
    scale_fill_gradient(low="blue", high="red", limits = c(-20, 40))+# Customise fill scale to use a gradient from blue to red
    theme_bw()+ # Apply a black-and-white theme
    guides(fill="none")+  # Remove the fill legend
    ggtitle(y)+ # Add a title to the plot with the current book's name
    coord_fixed(ratio = 1, ylim = c(10,1), xlim = c(0.5,most_chapter+0.5))+  # Fix the aspect ratio and set limits for y and x axes
    theme(  # Customize theme to remove y-axis labels and ticks
      axis.title.y = element_blank(),
      axis.text.y= element_blank(),
      axis.ticks.y = element_blank()
    )
  
  fp <- file.path(dir_outS, paste0(y, ".png"))# Define the file path where the plot will be saved
  
  ggsave(plot = p,  # Save the ggplot object as a PNG file
         filename = fp,   # File path where the plot will be saved
         device = "png", # Output device type (PNG format)
         width=3500,# Width of the output in pixels
         height = 1000, # Height of the output in pixels
         units = "px") # Units of width and height (pixels)
  
}# Close the loop


imgs <- list.files(dir_outS, full.names = TRUE) # List all file names in the directory 'dir_out' and store them in 'imgs'
img_list <- lapply(imgs, image_read)  # Read each image file from the list of file names using 'image_read' and store them in 'img_list'

img_joined <- image_join(img_list) # Join the list of images into a single animated image using 'image_join'

img_animated <- image_animate(img_joined, fps = 1) # Create an animated image from the joined image with a frame 
image_write(image = img_animated,
            path = "outputs/Sherlock.gif")

img_animated

THE END